Large-Scale Neighbor-Joining with NINJA
نویسنده
چکیده
Neighbor-joining is a well-established hierarchical clustering algorithm for inferring phylogenies. It begins with observed distances between pairs of sequences, and clustering order depends on a metric related to those distances. The canonical algorithm requires O(n) time and O(n) space for n sequences, which precludes application to very large sequence families, e.g. those containing 100,000 sequences. Datasets of this size are available today, and such phylogenies will play an increasingly important role in comparative genomics studies. Recent algorithmic advances have greatly sped up neighbor-joining for inputs of thousands of sequences, but are limited to fewer than 13,000 sequences on a system with 4GB RAM. In this paper, I describe an algorithm that speeds up neighbor-joining by dramatically reducing the number of distance values that are viewed in each iteration of the clustering procedure, while still computing a correct neighbor-joining tree. This algorithm can scale to inputs larger than 100,000 sequences because of external-memory-efficient data structures. A free implementation may by obtained from http://nimbletwist.com/software/ninja.
منابع مشابه
Efficient Large - Scale Phylogeny Reconstruction
In this study we introduce two novel distance-based algorithms with provably high computational and statistical efficiency. Furthermore, we report the results of experiments simulating sequence evolution on large trees with 135, 500, and 1895 leaves showing high success rates of our algorithms for large mutation probabilities, and high success rates of the popular Neighbor-Joining algorithm for...
متن کاملFe b 20 06 WHY NEIGHBOR - JOINING WORKS
We show that the neighbor-joining algorithm is a robust quartet method for constructing trees from distances. This leads to a new performance guarantee that contains Atteson’s optimal radius bound as a special case and explains many cases where neighbor-joining is successful even when Atteson’s criterion is not satisfied. We also provide a proof for Atteson’s conjecture on the optimal edge radi...
متن کاملFastJoin, an improved neighbor-joining algorithm.
Reconstructing the evolutionary history of a set of species is an elementary problem in biology, and methods for solving this problem are evaluated based on two characteristics: accuracy and efficiency. Neighbor-joining reconstructs phylogenetic trees by iteratively picking a pair of nodes to merge as a new node until only one node remains; due to its good accuracy and speed, it has been e...
متن کاملEfficient Construction of accurate Multiple alignments and Large-Scale phylogenies
A central focus of computational biology is to organize and make use of vast stores of molecular sequence data. Two of the most studied and fundamental problems in the field are sequence alignment and phylogeny inference. The problem of multiple sequence alignment is to take a set of DNA, RNA, or protein sequences and identify related segments of these sequences. Perhaps the most common use of ...
متن کاملNeighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites
Intrastrand base pairings give ribosomal and other RNA molecules characteristic structures that are important for their function. In order to maintain these structures, a substitution at one paired site may have to be compensated for by an appropriate substitution at the complementary site. Thus paired sites do not evolve independently of one another. Most current methods for inferring phylogen...
متن کامل